class: center, middle, inverse, title-slide .title[ # Survey Data Analysis with Kobocruncher ] .subtitle[ ## Session 6 - Cleaning and Indicator Calculation ] .author[ ###
Link to Documentation
–
Link to Previous Session
–
Link to Next Session
] .date[ ### Training Content as of 02 December 2022 ] --- ## When do yo need to clean the data? Survey data cleaning involves identifying and removing responses from individuals who either don’t match your target audience criteria or didn’t answer your questions thoughtfully. This filtering is done to avoid drawing misleading conclusions. Data cleaning remains a last resort option that can be at first minimized by: * .large[__Quality of questionnaire design__] not only to minimize social desirability and biased questions but also to ensure that the interview duration is limited (_ideally less than 45 minutes for a face to face interview and less than 25 minutes for a telephone interview_) * .large[__Good form encoding__] - with well defined [constraints](https://xlsform.org/en/#constraints) and [skip logic](https://xlsform.org/en/#relevant) and [requirement](https://xlsform.org/en/#required) to avoid Inconsistent Responses, sufficient testing to ensure that the questions are well understood and the responses options are covering well the options * .large[__Good training for the data enumerator__] and detailed [question hints](https://xlsform.org/en/#hints) so the enumerators fills correctly the questionnaire * .large[__Sufficient data collection quality monitoring__] to identify, prevent and cure issues early on. This can be done through [High Frequency Check](https://unhcr.github.io/HighFrequencyChecks/docs/). This should help to early on flag Straightlining / Patterned Responses if one enumerator uses for instance the same answer option ("B") over and over (as an example for at least five rows in a grid)... .bg-blue[ For data quality, prevention is lot more effective, quicker and cheaper than curing. Take the time to thoroughly test the questionnaire before starting full on data collection. ] ??? https://dimewiki.worldbank.org/Checklist:_Data_Cleaning https://dimewiki.worldbank.org/Data_Cleaning --- ## Cleaning is the most time consuming task: go through your initial exploration report to identify issues! In order to guide this selection phase, data experts, in collaboration with the data analysis group, can use the following elements: * For numeric value, check the .large[__frequency distributions__] of each variable to average, deviation, including outlier and oddities * For categorical variables, check for .large[__unexpected values__]: any weird results based on common sense expectations * Use cross-tabulation to verify potential .large[__illogical combination__] of answers (for instance "pregnant men"... ) * Use correlation analysis to check for potential .large[__contradictions__] in respondents answers to different questions for identified associations (chi-square) * Always, Check for .large[__missing data__] (NA) or "%of respondent who answered" (in the chart caption) that you cannot confidently explain * Check unanswered questions, that corresponds to .large[__unused skip logic__] in the questionnaire: For instance, did a person who was never displaced answer displacement-related questions? Were employment-related answers provided for a toddler? --- ### Variable cleaning Cleaning involves cleaning records (rows) and variable (columns). Cleaning variables is relevant for a few situations: * System variables ( precise date) and Section timestamps * Removing .large[__direct identifiers__] such as: * Name and surname * Document number (passport, national ID, driving license, etc.) * Address or other precise geographic information * GPS coordinates, if collected during face to face interviews. In some situation, it's possible to at least decrease the accuracy of the coordinates (removing digit).... * Telephone number .large[Fix them] by setting the concerned variable as `identifier` in the `anonymise` column the `xlsform`: see session [08-Anonymising](08-Anonymising.html) --- ## Situation when you will still need minimum record cleaning a priori Whatever is quality of form design, enumerator training and data collection quality monitoring, there will be still cases where cleaning will involve removing entire records: * Remove from the dataset records where no consent were obtained and/or more broadly one a specific filter/condition (where the respondent do not meet certain criteria or data from an unreliable enumerator identified during data collection quality monitoring... * Remove duplicate respondent ID based on the original sample list * Often survey includes nested tables (aka `repeat`), so if you remove records from the main table, you need to remove linked records in the nested tables kobo_remove will remove records based on a specific --- ## Clean based on time * Remove from the dataset records before or at specific dates .large[Fix them] by setting within the `xlsform`: * Starting date of the data collection in the `clean` column / `start` row * Ending date of the data collection in the `clean` column / `End` row * Remove from the dataset records when interview duration appears as outliers, either too long or too short, aka "speed responses" --- ## Situation when you will still need minimum cleaning a posteriori In other cases, cleaning will involve .large[`recoding`] some variables: 1. Recode un-explainable .large[__outliers for numerical questions__]. An example of this would be if you asked how much water one person use in a day and someone answered that they use 1000 liters, while the second largest usage reported is 150 liters. 2. Recode questions consecutive from .large[__"or other" choices__]. 3. Recode some questions answer as .large[__new calculated variables__] to have more balanced response categories based on frequency or potential closes meaning --- ## Outliers for numerical questions * Outliers: values significantly different from others * Outliers should be removed or modified only if they are (clearly) wrong values * Common outlier definition: observations three standard deviations from the mean .large[identify outliers] by looking at histograms, boxplots and scatter plots from the exploration report .large[Fix them] by setting the maximum accepted standard deviation for the variable in the `clean` column of the `xlsform` --- ## Recode categories, treat "or other" choices Often, categorical questions include an "or other" variables and this option might be mis-used In the exploratory report, those are plotted with a word cloud --- ## New calculated variables use case * filters on specific criteria * Create a filters on specific criteria * Ratio between 2 numeric variable * Calculation on date * Discretization of numeric variable according to quintile * Discretization of numeric variable according to fixed break * Aggregate variable from nested frame (aka within repeat) to parent table * filters on specific criteria --- ## kobo_indicator In kobocruncher, this is done with [kobo_indicator](https://edouard-legoupil.github.io/kobocruncher/reference/kobo_indicator.html) function. The function goes through steps: 1 - load the already defined indicators in the xlsform 2 - append new indicator supplied to the function if any, 3 - apply the indicator, i.e. do the calculation, 4 - re-save all the working indicator definition within the extended xlsform, dedicated indicator worksheet 5 - bind the new indicators in the dictionary for further plotting 6 - rebuild the plan if indicators are allocated to chapter, subchapter --- ## Create a filters on specific criteria ```r dplyr::if_else(datalist[1]$variable =="criteria", "yes","no") ``` --- ## Ratio between 2 numeric variable ```r datalist[1]$varnum1 / datalist[1]$varnum2 ``` --- ## Calculation on date ```r lubridate::interval( datalist[1]$datetocheck, lubridate::today()) %/% months(1) ``` --- ## Discretization of numeric variable according to quintile ```r Hmisc::cut2(datalist[1]$varnum, g =5) ``` --- ## Discretization of numeric variable according to fixed break for instance case size from integer to categoric ```r cut(datalist[1]$casesize, breaks = c(0, 1, 2, 3,5,30), labels = c("Case.size.1", "Case.size.2", "Case.size.3", "Case.size.4.5", "Case.size.6.or.more" ), include.lowest=TRUE) ``` --- ## Aggregate variable from nested frame (aka within repeat) to parent table ```r datalist[2] |> dplyr::select( members.sex, parent_index) |> tidyr::gather( parent_index, members.sex) |> dplyr::count(parent_index, members.sex) |> tidyr::spread(members.sex, n, fill = 0) |> dplyr::select( female) ``` --- class: inverse, center, middle # TIME TO PRACTISE ON YOUR OWN! ### .large[.white[
] **10 minutes! **]
−
+
10
:
00
Open again your expanded xlsfrom, set up the outliers treatment, clean the _"or_other"_ in the .large[clean] column. Then add calculated variables. Save and knit again your report! Do not hesitate to raise your questions in the [ticket system](https://github.com/Edouard-Legoupil/kobocruncher/issues/new) or in the chat so the training content can be improved accordingly! --- class: inverse, center, middle ### .large[.white[
] **Let's take a break! **]
−
+
05
:
00
__Next session__: [07-Weighting If the data was created through a probabilistic selection sampling approach, then we can apply weighting to the data before and regenerate the report so that those weights are reflected](07-Weighting.html)